Add trim_galore test fixtures (for MultiQC#3538)#377
Merged
Conversation
Test fixtures for the new MultiQC `trim_galore` module proposed in MultiQC/MultiQC#3538. These are real Trim Galore v2.1.0-beta.5 outputs: - sample_R1.fastq.gz_trimming_report.json — single-end (10K Illumina, long adapter-length distribution including the 1-bp tail typical for --stringency 1 default) - BS-seq_10K_R{1,2}.fastq.gz_trimming_report.json — paired-end pair (BS-seq 10K, with `pair_validation` populated and short adapter-length tails — useful coverage for the PE code path) Schema reference: schema_version 1, documented in the upstream issue thread at MultiQC/MultiQC#3529.
FelixKrueger
added a commit
to FelixKrueger/MultiQC
that referenced
this pull request
Apr 27, 2026
The fixtures originally landed in `tests/data/modules/trim_galore/` of the main repo, but `test_modules_run.py` resolves test data via `<repo>/test-data/data/modules/<module>/` (a separate sibling repo, MultiQC/test-data). Moving them there in MultiQC/test-data#377.
ewels
reviewed
May 11, 2026
Member
ewels
left a comment
There was a problem hiding this comment.
These are still up to date, post-release right?
Nit: Please could you include the associated log files as well? I want to make sure that we don't show sections for both TrimGalore and Cutadapt together, so it would help to test for that blocking effect.
Contributor
Author
ewels
added a commit
to MultiQC/MultiQC
that referenced
this pull request
May 11, 2026
* Add native MultiQC module for Trim Galore v2.x (Oxidized Edition) Closes #3529. Trim Galore v2.x emits a structured `*_trimming_report.json` (schema v1) alongside the legacy `*_trimming_report.txt` report. The text report still carries the `"This is cutadapt"` shim for backwards compatibility, so the existing `cutadapt` module path keeps working unchanged. This new module parses the JSON natively, which: - Gets the Software Versions table right ("Trim Galore X.Y.Z" instead of the misleading "Cutadapt 4.0" backwards-compat shim) - Surfaces TrimGalore-specific stats not available from Cutadapt output (RRBS truncation counts, poly-A/G trimming, paired-end pair-validation outcomes — the latter two are wired through to the data file but not yet plotted; happy to add follow-up sections) ## What's plotted - General stats columns: % adapter, % pass, % q-trimmed, total reads (hidden), total bp written (hidden) - Filtered reads bargraph: passing / too_short / too_long / too_many_n / discarded_untrimmed - Adapter length distribution linegraph (per sample, per adapter when a sample has more than one) ## Sample-name handling PE TrimGalore reports list both R1 and R2 in `input_filenames` (both JSONs do — Trim Galore preserves the pair context). The parser uses the JSON's `read_number` field to pick the correct filename, so R1 and R2 become distinct samples. ## Coexistence with the cutadapt module Both modules will discover their respective files (text vs JSON). With both enabled, each sample appears in both modules' general-stats columns. Users who want to disable the cutadapt path on TrimGalore samples can: ```yaml disable_modules: - cutadapt ``` Documented in the module's class docstring. ## Test fixtures `tests/data/modules/trim_galore/` contains: - `sample_R1.fastq.gz_trimming_report.json` — SE example (10K Illumina) - `BS-seq_10K_R{1,2}.fastq.gz_trimming_report.json` — PE example (BS-seq 10K, with `pair_validation` populated) Verified locally: `multiqc -m trim_galore tests/data/modules/trim_galore/` produces 3/3 reports parsed (1 SE + 2 PE), all sections rendered, data file written to `multiqc_data/multiqc_trim_galore.txt`. ## Schema reference JSON schema v1 is documented in the upstream issue thread (linked in the issue body). The parser version-gates on `schema_version: 1` and warns + skips files with a different version, so a future schema bump won't silently misparse. ## Status Marking as draft. Initial scope is intentionally focused — happy to extend for poly-A/G trimming sections, pair-validation visualisation, RRBS-specific stats, or anything else the maintainers want before a final review pass. * Apply prettier + ruff format from prek hooks * Move test fixtures to MultiQC/test-data fork (companion PR) The fixtures originally landed in `tests/data/modules/trim_galore/` of the main repo, but `test_modules_run.py` resolves test data via `<repo>/test-data/data/modules/<module>/` (a separate sibling repo, MultiQC/test-data). Moving them there in MultiQC/test-data#377. * Address PR review feedback on trim_galore module - Move write_data_file to end of __init__ and flatten payload to scalar columns so multiqc_trim_galore.txt is machine-readable - Call add_software_version unconditionally; bail on ignored samples inside the parse loop - Drop module-level docstring and unicode divider comments per project style; tone down class docstring - Drop redundant _strip_fastq_suffix helper in favour of clean_s_name - Add SampleGroupingConfig so PE pairs collapse cleanly with table_sample_merge (weighted-average percentages, sum counts) - Remove hardcoded bargraph colours; use uniform composite keys in the adapter-length plot and continue on zero-adapter samples - Drop % Q-trim precision to {:,.1f}; surface tg_total_reads by default - Bump schema_version mismatch to log.error with explicit guidance - Simplify search_patterns.yaml (drop contents/num_lines shim) - Group trim_galore adjacent to cutadapt in config_defaults.yaml - Revert CHANGELOG.md entry (generated from PR titles) Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Log debug message when JSON tool field is not Trim Galore Helps diagnostics if a non-TrimGalore JSON happens to match the filename glob. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Auto-suppress cutadapt module for Trim Galore v2.x text reports The cutadapt module's text-report pattern matches any file containing "This is cutadapt", which also catches the backwards-compatibility shim that Trim Galore v2.x writes alongside its native JSON report. Result: every v2.x sample shows up twice — once via cutadapt (as a misleading "Cutadapt 4.0"), once via trim_galore. Telling users to disable cutadapt globally also kills parsing of genuine cutadapt logs and legacy Trim Galore v0/v1 reports, so it isn't a real fix. Add an exclude_contents_re to the cutadapt text-report pattern matching "Trim Galore version: " followed by a major version of 2 or higher. v0.x / v1.x text reports continue to be picked up by cutadapt; v2.x text reports are skipped (the sibling JSON is handled by trim_galore); pure cutadapt logs are unaffected. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add Pair Validation, Poly-A/G, and RRBS sections to trim_galore Surface the schema-v1 fields that were already in the data file but not plotted: pair_validation, poly_a_trimming, poly_g_trimming, rrbs. Each is a small table with sensible gating. - Pair Validation: collapses R1/R2 (pair-level data is identical between them), drops rows where less than 0.1% of pairs were affected. - Poly-A/G and RRBS: per-row gating, samples with zero counts are hidden. - All three sections show a Bootstrap alert listing dropped samples, with long lists wrapped in <details> (bases2fastq pattern). - Defensive try/except around length_distribution int-coercion so a malformed key downgrades to a debug log rather than crashing the run. - Data file flattening extended to all of pair_validation, poly_a_trimming, poly_g_trimming and rrbs blocks (25 columns total). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add support for sample grouping * Add explicit_groups for deterministic tool-derived sample grouping Extend SampleGroupingConfig with an `explicit_groups` parameter that lets modules supply their own ground-truth groups instead of relying on the user's `table_sample_merge` name patterns. Useful when the tool output already tells you which samples are related — paired-end trimmers that emit both filenames in each report, lane manifests, replicate IDs, etc. The framework silently ignores entries with a single member so callers don't need to filter them out themselves. Wire trim_galore to use this. Each JSON's `tuple(input_filenames)` is a stable pair key (byte-identical between R1 and R2 of the same pair). Auto-grouping applies to: - General Stats table — framework path with expand-to-see-individuals - Pair Validation table — manual collapse keyed on the same pair_key Filtered Reads bargraph, Poly-A/G and RRBS tables stay per-read because R1 and R2 stats there can legitimately differ. Users with `table_sample_merge` configured layer name-pattern grouping on top of the auto-derived pairs. The `trim_galore_config.auto_group_pairs: false` flag opts out of auto-grouping entirely. Replaces the earlier `_apply_grouping` helper that relied on `config.table_sample_merge` to pre-aggregate filtered_reads / poly / RRBS — those now stay per-read regardless of grouping config. Docs updated: developer guide gets a worked example for module authors with authoritative pair info; user-facing customisation page describes the auto-grouping behaviour and the opt-out. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Simplify trim_galore module after review pass - Store pair_key_to_samples and pair_display_by_key on `self` so _pair_validation_plot and _derive_auto_groups drop their extra parameters - Add a small _add_filtered_section helper that wraps the plot+description+alert+add_section pattern, collapsing three near-identical 11-line blocks - Simplify the gen_stats type annotation from a quadruple Union workaround to `Dict[str, Dict[ColumnKey, Any]]` plus a single `cast(Any, ...)` at the addcols call site - Drop the unused `Union` import that fell out of the above - Tighten narrative comments per CLAUDE.md (keep only WHY) Net: -30 lines, no behaviour change. Lint / mypy / module tests all clean across the three grouping scenarios (default, with table_sample_merge, and opt-out). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Bit of clean up * Manual review of docs * Schema version: assume semver, only throw error on major version bump * Remove some excessively cautious code * Make code way less defensive still. If the data is that badly mangled, I'd rather it throw an exception instead of silently default to fake numbers * Better docstring / docs * Tidy up descriptions / helptext a bit2 --------- Co-authored-by: Phil Ewels <phil.ewels@seqera.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Companion PR to MultiQC/MultiQC#3538, which adds a native MultiQC module for Trim Galore v2.x.
This PR drops the test fixtures the new module needs into
data/modules/trim_galore/. Real Trim Galore v2.1.0-beta.5 JSON outputs:sample_R1.fastq.gz_trimming_report.json— single-end (10K Illumina, long adapter-length distribution including the 1-bp tail typical for the default--stringency 1)BS-seq_10K_R{1,2}.fastq.gz_trimming_report.json— paired-end pair (BS-seq 10K, withpair_validationpopulated and short adapter-length tails — covers the PE code path)Schema reference:
schema_version: 1, documented in the upstream MultiQC issue thread.The MultiQC PR's
test_modules_run.py::test_all_modules[trim_galore-…]andtest_ignore_samples[trim_galore-…]checks fail until this PR merges (they look fortest-data/data/modules/trim_galore/). Happy to coordinate merge order — most natural is to merge this first, then unblock the MultiQC PR's CI.